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Abstract 


Infinite Hidden Markov Models (iHMM’s) are an attractive, nonparametric generaliza¬ 
tion of the classical Hidden Markov Model which can automatically infer the number of 
hidden states in the system. However, due to the infinite-dimensional nature of transi¬ 
tion dynamics performing inference in the iHMM is difficult. In this paper, we present an 
infinite-state Particle Gibbs (PG) algorithm to resample state trajectories for the iHMM. 

The proposed algorithm uses an efficient proposal optimized for iHMMs and leverages an¬ 
cestor sampling to suppress degeneracy of the standard PG algorithm. Our algorithm 
demonstrates significant convergence improvements on synthetic and real world data sets. 
Additionally, the infinite-state PG algorithm has linear-time complexity in the number of 
states in the sampler, while competing methods scale quadratically. 

Keywords: Bayesian Nonparametrics, Hidden Markov Models, Model Selection, Particle 
MCMC 

1. Introduction 

Hidden Markov Models (HMM’s) are among the most widely adopted latent-variable models 
used to model time-series datasets in the statistics and machine learning communities. They 
have also been successfully applied in a variety of domains including genomics, language, 
and finance where sequential data naturally arises (Rabiner, 1989; Bishop, 2006). 

One possible disadvantage of the finite-state space HMM framework is that one must 
a-priori specify the number of latent states K. Standard model selection techniques can be 
applied to the finite state-space HMM but bear a high computational overhead since they 
require the repetitive training/exploration of many HMM’s of different sizes. 

Bayesian nonparametric methods offer an attractive alternative to this problem by 
adapting their effective model complexity to fit the data. In particular, Beal et al. (2001) 
constructed an HMM over a countably infinite state-space using a Hierarchical Dirichlet 
Process (HDP) prior over the rows of the transition matrix. Various approaches have been 
taken to perform full posterior inference over the latent states, transition/emission distri¬ 
butions and hyperparameters since it is impossible to directly apply the forward-backwards 
algorithm due to the infinite-dimensional size of the state space. The original Gibbs sam¬ 
pling approach proposed in Teh et al. (2006) suffered from slow mixing due to the strong 
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correlations between nearby time steps often present in time-series data (Scott, 2002 ). 
However, Van Gael et al. (2008) introduced a set of auxiliary slice variables to dynamically 
“truncate” the state space to be finite (referred to as beam sampling), allowing them to use 
dynamic programming to jointly resample the latent states thus circumventing the problem. 
Despite the power of the beam-sampling scheme. Fox et al. (2008) found that application 
of the beam sampler to the (sticky) iHMM resulted in slow mixing relative to an inexact, 
blocked sampler due to the introduction of auxiliary slice variables in the sampler. 

The main contributions of this paper are to derive an infinite-state PG algorithm for 
the iHMM using the stick-breaking construction for the HDP, and constructing an opti¬ 
mal importance proposal to efficiently resample its latent state trajectories. The proposed 
algorithm is compared to existing state-of-the-art inference algorithms for iHMMs, and em¬ 
pirical evidence suggests that the infinite-state PG algorithm consistently outperforms its 
alternatives. Furthermore, by construction the time complexity of the proposed algorithm 
is 0{TNK). 

Here T denotes the length of the sequence, N denotes the number of particles in the PG 
sampler, and K denotes the number of “active” states in the model. Despite the simplicity 
of sampler, we find in a variety of synthetic and real-world experiments that these particle 
methods dramatically improve convergence of the sampler, while being more scalable. 

We will first define the iHMM and sticky iHMM, reviewing the the Dirichlet Process 
(DP) and Hierarchical Dirichlet Process (HDP) in our appendix, in Section 2. Then we 
move onto to the description of MGMC sampling scheme in Section 3. In Section 4 we 
present our results on a variety of synthetic and real-world datasets. 

2. Model and Notation 

2.1. Infinite Hidden Markov Models 

We can formally define the iHMM (we review the theory of the HDP in our appendix) as 
follows: 


/3 ~ GEM( 7 ), 

7r,|/3^^^DP(a,/3), H, i = l,...,oo (1) 

st\st-i ~/(•|(/>5,), t = l,...,r. 

Here (3 is the shared DP measure defined on integers Z. Here si:T = (si,...,st) are the 
latent states of the iHMM, yi-T = {yi,---,yT) are the observed data, and parametrizes 
the emission distribution /. Usually H and / are chosen to be conjugate to simplify the 
inference. (5y can be interpreted as the prior mean for transition probabilities into state k', 
with a governing the variability of the prior mean across the rows of the transition matrix. 
The hyper-parameter 7 controls how concentrated or diffuse the probability mass of (3 will 
be over the states of the transition matrix. To connect the HDP with the iHMM, note 
that given a draw from the HDP Gk = '^kk'^ipy we identify with the transition 

probability from state k to state k' where 0 ^/ parametrize the emission distributions. 

Note that fixing /3 = (^,....,^,0,0...) implies only transitions between the first K states 
of the transition matrix are ever possible, leaving us with the finite Bayesian HMM. If we 
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Figure 1: Graphical Model for the sticky HDP-HMM (setting k = 0 recovers the HDP- 
HMM) 


define a finite, hierarchical Bayesian HMM by drawing 

f3 ~ T>ir{'y/K, ...,^/K) 

TTfc ~ Vir{a(3) 

with joint density over the latent/hidden states as 

P<l>{si:T,yi-.T) = Ilf^i7T{st\st-l)U{yt\st) 

then after taking K ^ oo, the hierarchical prior in Equation (2) approaches the HDP. 


2.2. Prior and Emission Distribntion Specification 

The hyperparameter a governs the variability of the prior mean across the rows of the 
transition matrix and 7 controls how concentrated or diffuse the probability mass of (3 will 
be over the states of the transition matrix. However, in the HDP-HMM we have each row of 
the transition matrix is drawn as ~ DP(a,/3). Thns the HDP prior doesn’t differentiate 
self-transitions from jnmps between different states. This can be especially problematic in 
the non-parametric setting, since non-Markovian state persistence in data can lead to the 
creation of unnecessary extra states and nnrealistically, rapid switching dynamics in our 
model. In Fox et al. (2008), this problem is addressed by including a self-transition bias 
parameter into the distribution of transitioning probability vector iVj: 


TT j DP(q: -|- k, 


af3 -|- ndj 
a + K 


) 


( 3 ) 


to incorporate prior beliefs that smooth, state-persistent dynamics are more probable. Such 
a construction only involves the introduction of one further hyperparameter k which controls 
the “stickiness” of the transition matrix (note a similar self-transition was explored in Beal 
et al. ( 2001 )). 

For the standard iHMM, most approaches to inference have placed vague gamma hyper¬ 
priors on the hyper-parameters a and 7 , which can be resampled efficiently as in Teh et al. 
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(2006). Similarly in the sticky iHMM, in order to maintain tractable resampling of hyper¬ 
parameters Fox et al. (2008) chose to place vague gamma priors on 7 , a + K, and a beta prior 
on K/{a + k). In this work we follow Teh et al. (2006); Fox et al. (2008) and place priors 
7 ~ Gamma(a-),, bj), a+K ~ Gamma(as, bg), and k ~ Beta(aK, b,^) on the hyper-parameters. 

We consider two conjugate emission models for the output states of the iHMM - a 
multinomial emission distribution for discrete data, and a normal emission distribution for 
continuous data. For discrete data we choose ~ Vir{a^) with /(• | = Cat(-|())fc). For 

continuous data we choose 4>k = (l^W^) ~ fi^) with f{-\(l)st) = = 

(^, 0 - 2 )). 

3. Posterior Inference for the iHMM 

Let us first recall the collection variables we need to sample: /3 is a shared DP base measure, 
(tt/s) is the transition matrix acting on the latent states, while parametrizes the emission 
distribution /, k = 1,..., K. We can then resample the variables of the iHMM in a series 
of Gibbs steps: 

Step 1: Sample Si:T\yi:T,4>l:K,P,T^l-.K- 
Step 2: Sample (3 \ si^t, 7 - 
Step 3: Sample tvi-k \ P, a, k, si-t- 
Step 4- Sample pi-x \ Ui-.T, si:T, H. 

Step 5: Sample {a,'y,K)\si:T,P,'^i-.K- 

Due to the strongly correlated nature of time-series data, resampling the latent hidden 
states in Step 1, is often the most difficult since the other variables can be sampled via 
the Gibbs sampler once a sample of si-^t has been obtained. In the following section, we 
describe a novel efficient sampler for the latent states si:^ of the iHMM, and refer the reader 
to our appendix and Teh et al. (2006); Fox et al. (2008) for a detailed discussion on steps 
for sampling variables a, 7 , k, /3, 4>1:K- 

3.1. Infinite State Particle Gibbs Sampler 

Within the Particle MCMG framework of Andrieu et al. (2010), Sequential Monte Garlo 
(or particle filtering) is used as a complex, high-dimensional proposal for the Metropolis- 
Hastings algorithm. The Particle Gibbs sampler is a conditional SMG algorithm resulting 
from clamping one particle to an apriori fixed trajectory. In particular, it is a transition 
kernel that has p(si:r|yi:T) as its stationary distribution. 

The key to constructing a generic, truncation-free sampler for the iHMM to resample the 
latent states, si:T, is to note that the finite number of particles in the sampler are “localized” 
in the latent space to a finite subset of the infinite set of possible states. Moreover, they 
can only transition to finitely many new states as they are propagated through the forward 
pass. Thus the “infinite” measure /3, and “infinite” transition matrix tt only need to be 
instantiated to support the number of “active” states (defined as being {!,..., AT}) in the 
state space. In the particle Gibbs algorithm, if a particle transitions to a state outside the 
“active” set, the objects P and tt can be lazily expanded via the stick-breaking constructions 
derived for both objects in Teh et al. (2006) and stated in equations (2), (4) and (5). Thus 
due to the properties of both the stick-breaking construction and the PGAS kernel, this 
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resampling procedure will leave the target distribution p(si:r|l/i:r) invariant. Below we first 
describe our infinite-state particle Gibbs algorithm for the iHMM then detail our notation 
(we provide further background on SMC in our supplement): 

Step 1: For iteration t = 1 initialize as: 

(a) sample ■si ~ for i G 1, ...,N 

(b) initialize weights w\ = p(.si)/i(i/i|si)/( 7 i(si) for i G 1, 

Step 2: For iteration t > 1 use trajectory from t — 1, fj, tt, cf), and K: 

(a) sample the index Ot-i ~ Cat{-\W^j_^) of the ancestor of particle f for f G 1, ...,N — 1. 

(b) sample si ~ qt{- \ for z G 1, ...,N — 1. \i s\ = K + 1 then create a new state 

using the stick-breaking construction for the HDP: 

(i) Sample a new transition probability vector ttx+i ~ Vir{a(3). 

(ii) Use stick-breaking construction to iteratively expand jd ^ [j3, I3 k+i] as: 

I^'k+i ~ Beta(l, 7 ), Pk+i = ~ f^e)- 

(iii) Expand transition probability vectors k = 1,..., K + 1, to include transi¬ 

tions to ii' -|- 1st state via the HDP stick-breaking construction as: 

'^j ^ ■ ■ ■ 7 T^j,K+l]7 Vj = 1,..., iP -|- 1. 


where 


K+l 

T^'jK+l ~ Beta(Q;o/^A') Oo(l ~ A)); '^jK+l = ~ 

i=l 


(iv) Sample a new emission parameter (^K+i ~ H. 

(c) compute the ancestor weights wl_^rp = w\_i'ii{sl\s\_i) and resample as 

P(af = i) oc uiAi|T- 

(d) recompute and normalize particle weights using: 


wt{sf) = 7:{s\\s,f-f)f{yt\s\)/qt{sl\s,f-f 
Wt{si) 


N 


Wt{s\) / C^Wt{s\)) 


2=1 


Step 3: Sample k with P(/c = z) oc wf and return = s\.rp. 

In the particle Gibbs sampler, at each step t a weighted particle system {sf wl}^i serves 
as an empirical point-mass approximation to the distribution p(si:'r), with the variables a\ 
denoting the ‘ancestor’ particles of sf Here we have used 7r(si|st_i) to denote the latent 
transition distribution, f{yt\st) the emission distribution, andp(si) the prior over the initial 
state si- 
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3.2. More Efficient Importance Proposal qt{-) 

In the PG algorithm described above, we have a choice of the importance sampling den¬ 
sity qt{-) to use at every time step. The simplest choice is to sample from the “prior” - 

qJ' • CL^ 

~ which can lead to satisfactory performance when then observa¬ 
tions are not too informative and the dimension of the latent variables are not too large. If 
we were to use the “prior” as our importance sampling density then the time-complexity of 
our sampler would be strictly 0{TN). However using the prior as importance proposal in 
particle MCMC is known to be suboptimal. In order to improve the mixing rate of the sam- 

pier, it is desirable to sample from the partial “posterior” - qt{- \ f {yA^t) 

- whenever possible . 

In general, sampling from the “posterior”, gt(-| oc TT{s2\s^^-^'^)f{yt\s^), may be 

impossible, but in the iHMM we can show that it is analytically tractable. To see this, 
note that we have lazily represented as a finite vector - 

Moreover, we can easily evaluate the likelihood f{y2\s^, for all s” G 1,..., K. However, 
= K -\- 1, we need to compute = K + 1) = f /(y^js^ = K + l,(j))H{(j))d(j). 

If / and H are conjugate, we can analytically compute the marginal likelihood of the 
X-l-lst state, but this can also be approximated by Monte Carlo sampling for non-conjugate 
likelihoods - see Neal (2000) for a more detailed discussion of this argument. Thus, we can 
compute p{yt\s'^_i) = Ydk=i I I 4'k) for each particle s" where n G 1, ..., N — 1 . 

We investigate the impact of “posterior” vs. “prior” proposals in Figure 5. Based on 
the convergence of the number of states and joint log-likelihood, we can see that sampling 
from the “posterior” improves the mixing of the sampler. Indeed, we see from the ’’prior” 
sampling experiments that increasing the number of particles from = 10 to = 50 
does seem to marginally improve the mixing the sampler, but have found A^ = 10 particles 
sufficient to obtain good results. However, we found no appreciable gain when increasing 
the number of particles from A^ = 10 to A^ = 50 when sampling from the “posterior” and 
omitted the curves for clarity. However, it is worth noting that the PG sampler does still 
perform reasonably even when sampling from the “prior” with the added advantage that 
it’s time complexity will simply be 0(TN). 

3.3. Improving Mixing via Ancestor Resampling 

It has been recognized that the mixing properties of the PG kernel can be poor due to path 
degeneracy (Lindsten et ah, 2014). A variant of PG that is presented in Lindsten et al. 
(2014) attempts to address this problem for any non-Markovian state-space model with a 
modification - resample a new value for the variable in an “ancestor sampling” step at 
every time step, which can significantly improve the mixing of the PG kernel with little 
extra computation in the case of Markovian systems. 

To understand ancestor sampling, for t >2 consider the reference trajectory s^.rp ranging 
from the current time step t to the final time T. Now, artificially assign a candidate history 
to this partial path, by connecting s[.rp to one of the other particles history up until that 
point which can be achieved by simply assigning a new value to the variable 
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af G 1, 


, N. To do this, we first compute the weights: 


W- 


t-l\T 


= W- 


t-V 


i = 1, ...,N 


(4) 


Then is sampled according to P(a^ = z) oc wl_^rp. Remarkably, this ancestor sampling 
step leaves the density p{si:T \ Vi-.t) invariant as shown in Lindsten et al. (2014) for arbitrary, 
non-Markovian state-space models. However since the infinite HMM is Markovian, we can 
show the computation of the ancestor sampling weights simplifies to 


wl-i\T = wl_iTr{si\sl_i) (5) 

Note that the ancestor sampling step does not change the 0(TNK) time complexity of the 
infinite-state PG sampler. 


3.4. Resampling tt, cj), (3, a, 7 , and k 

Our resampling scheme for tt, /3, 0, a, 7 , and n will follow straightforwardly from this 
scheme in Fox et al. (2008); Teh et al. (2006). We present a review of their methods and 
related work in our appendix for completeness. 


4. Empirical Study 

In the following experiments we explore the performance of the PG sampler on both the 
iHMM and the sticky iHMM. Note that throughout this section we have only taken iV = 10 
and = 50 particles for the PG sampler which has time complexity 0{TNK) when 
sampling from the “posterior”, and 0{TN) when sampling from the “prior”, compared to 
the time complexity of 0{TK‘^) of the beam sampler. For completeness, we also compare 
to the Gibbs sampler, which has been shown perform worse than the beam sampler (Van 
Gael et ah, 2008), due to strong correlations in the latent states. 

4.1. Convergence on Synthetic Data 

To study the mixing properties of the PG sampler on the iHMM and sticky iHMM, we 
consider two synthetic examples with strongly positively correlated latent states. First as 
in Van Gael et al. (2008), we generate sequences of length 4000 from a 4 state HMM with 
self-transition probability of 0.75, and residual probability mass distributed uniformly over 
the remaining states where the emission distributions are taken to be normal with fixed 
standard deviation 0.5 and emission means of —2.0, —0.5,1.0,4.0 for the 4 states. The base 
distribution, H for the iHMM is taken to be normal with mean 0 and standard deviation 2, 
and we initialized the sampler with ih = 10 “active” states. In the 4-state case, we see in 
Figure 2 that the PG sampler applied to both the iHMM and the sticky iHMM converges to 
the “true” value of iF = 4 much quicker than both the beam sampler and Gibbs sampler - 
uncovering the model dimensionality, and structure of the transition matrix by more rapidly 
eliminating spurious “active” states from the space as evidenced in the learned transition 
matrix plots in Figure 3. Moreover, as evidenced by the joint log-likelihood in Figure 2, we 
see that the PG sampler applied to both the iHMM and the sticky iHMM converges quickly 
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Figure 2: Comparing the performance of the 
PG sampler, PG sampler on sticky iHMM (PG- 
S), beam sampler, and Gibbs sampler on in¬ 
ferring data from a 4 state strongly correlated 
HMM. Left: Number of “Active” States K vs. 
Iterations Right: Joint-Log Likelihood vs. Iter¬ 
ations (Best viewed in color) 


PQA8 Beam Qiourd Truth 

a 


Figure 3: Learned Latent Transition Ma¬ 
trices for the PG sampler and Beam Sam¬ 
pler vs Ground Truth (Transition Ma¬ 
trix for Gibbs Sampler omitted for clar¬ 
ity). PG correctly recovers strongly cor¬ 
related self-transition matrix, while the 
Beam Sampler supports extra “spurious” 
states in the latent space. 




to a good mode, while the beam sampler has not fully converged within a 1000 iterations, 
and the Gibbs sampler is performing poorly. 

To further explore the mixing of the PG sampler vs. the beam sampler^ we consider 
a similar inference problem on synthetic data over a larger state space. We generate data 
from sequences of length 4000 from a 10 state HMM with self-transition probability of 0.75, 
and residual probability mass distributed uniformly over the remaining states, and take the 
emission distributions to be normal with fixed standard deviation 0.5 and means equally 
spaced 2.0 apart between —10 and 10. The base distribution, H, for the iHMM is also taken 
to be normal with mean 0 and standard deviation 2. The samplers were initialized with 
K = 3 and K = 30 states to explore the convergence and robustness of the infinite-state 
PG sampler vs. the beam sampler. 

As observed in Figure 4, we see that the PG sampler applied to the iHMM and sticky 
iHMM, converges far more quickly from both “small” and “large” initialization of A = 3 
and K = 30 “active” states to the true value of A = 10 hidden states, as well as converging 
in JLL more quickly. Indeed, as noted in Fox et al. (2008), the introduction of the extra 
slice variables in the beam sampler can inhibit the mixing of the sampler, since for the beam 
sampler to consider transitions with low prior probability one must also have sampled an 
unlikely corresponding slice variable so as not to have truncated that state out of the 
space. This can become particularly problematic if one needs to consider several of these 
transitions in succession. We believe this provides evidence that the infinite-state Particle 
Gibbs sampler presented here, which does not introduce extra slice variables, is mixing 
better than beam sampling in the iHMM. 

1. We no longer explore the performance of the Gibbs sampler since based on our previous experiment, and 
extensive experimentation in Van Gael et al. (2008), we believe the Gibbs sampler to be strictly worse 
than the beam sampler. 
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Figure 4: Comparing the performance of 
the PG sampler vs. beam sampler on in¬ 
ferring data from a 10 state strongly cor¬ 
related HMM with different initializations. 
Left: Number of “Active” States K from dif¬ 
ferent Initial K vs. Iterations Right: Joint- 
Log Likelihood from different Initial K vs. 
Iterations 

4.2. Ion Channel Recordings 


Figure 5: Influence of “Posterior” vs. 
“Prior” proposal and Number of Particles 
in PG sampler on iHMM. Left: Num¬ 
ber of “Active” States K from differ¬ 
ent Initial K, Numbers of Particles, and 
“Prior” / “Posterior” proposal vs. Iterations 
Right: Joint-Log Likelihood from differ¬ 
ent Initial K, Numbers of Particles, and 
“Prior”/” Posterior” proposal vs. Iterations 


For our first real dataset, we investigate the behavior of the PG sampler and beam sampler 
on an ion channel recording. In particular, we consider a IMHz recording from Rosenstein 
et al. (2013) of a single alamethicin channel previously investigated in Palla et al. (2014). 
We subsample the time series by a factor of 100, truncate it to be of length 2000, and further 
log transform and normalize it. 


Beam: Latent States PGAS: Latent States 



t t 

Figure 6: Left: Observations colored by an inferred latent trajectory using beam sampling 
inference. Right: Observations colored by an inferred latent state trajectory using PG 
inference. 

We ran both the beam and PG sampler on the iHMM for 1000 iterations (until we 
observed a convergence in the joint log-likelihood). Due to the large fluctuations in the 
observed time series, the beam sampler infers the number of “active” hidden states to be 
K = 5 while the PG sampler infers the number of “active” hidden states to be iF = 4. 
However in Figure 6, we see that beam sampler infers a solution for the latent states which 
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rapidly oscillates between a subset of likely states during temporal regions which intuitively 
seem to be better explained by a single state. However, the PG sampler has converged to 
a mode which seems to better represent the latent transition dynamics, and only seems to 
infer “extra” states in the regions of large fluctuation. Indeed, this suggests that the beam 
sampler is mixing worse with respect to the PG sampler. 

4.3. Alice in Wonderland Data 

For our next example we consider the task of predicting sequences of letters taken from 
Alice’s Adventures in Wonderland. We trained an iHMM on the 1000 characters from the 
first chapter of the book, and tested on 4000 subsequent characters from the same chapter 
using a multinomial emission model for the iHMM. 





Figure 7: Left: Gomparing the Joint Log-Likelihood vs. Iterations for the PG sampler and 
Beam sampler. Middle: Gomparing the convergence of the “active” number of states for 
the iHMM and sticky iHMM for the PG sampler and Beam sampler. Right: Trace plots of 
the number of states for different initializations for K. 

Once again, we see that the PG sampler applied to the iHMM/sticky iHMM converges 
quickly in joint log-likelihood to a mode where it stably learns a value of RT ~ 10 as evi¬ 
denced in Figure 7. Though the performance of the PG and beam samplers appear to be 
roughly comparable here, we would like to highlight two observations. Firstly, the inferred 
value of K obtained by the PG sampler quickly converges independent of the initialization 
K in the rightmost of Figure 7. However, the beam sampler’s prediction for the num¬ 
ber of active states K still appears to be decreasing and more rapidly fluctuating than 
both the iHMM and sticky iHMM as evidenced by the error bars in the middle plot in 
addition to being quite sensitive to the initialization K as shown in the rightmost plot. 
Based on the previous synthetic experiment (Section 4.1), and this result we suspect that 
although both the beam sampler and PG sampler are quickly converging to good solutions 
as evidenced by the training joint log-likelihood, the beam sampler is learning a transition 
matrix with unnecessary/spurious “active” states. Next we calculate the predictive log- 
likelihood of the Alice in Wonderland test data averaged over 2500 different realizations 
and find that the infinite-state PG sampler with iV = 10 particles achieves a predictive log- 
likelihood of —5918.4 ih 123.8 while the beam sampler achieves a predictive log-likelihood 
of —6099.0 ih 106.0, showing the PG sampler applied to the iHMM and Sticky iHMM 
learns hyperparameter and latent variable values that obtain better predictive performance 
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on the held-out dataset. We note that in this experiment as well, we have only found it 
necessary to take N = 10 particles in the PG sampler achieve good mixing and empirical 
performance, although increasing the number of particles to = 50 does improve the con¬ 
vergence of the sampler in this instance. Given that the PG sampler has a time complexity 
of 0{TNK) for a single pass, while the beam sampler (and truncated methods) have a time 
complexity of 0{TK‘^) for a single pass, we believe that the PG sampler is a competitive 
alternative to the beam sampler for the iHMM. 

5. Discussions and Conclusions 

In this work we derive a new inference algorithm for the iHMM using the particle MGMC 
framework based on the stick-breaking construction for the HDP. We also develop an efficient 
proposal inside PG optimized for iHMMs, to efficiently resample the latent state trajectories 
of the iHMM, and the sticky iHMM. The proposed algorithm is empirically compared to 
existing state-of-the-art inference algorithms for iHMMs, and the algorithm presented here 
is particularly promising because it converges more quickly and robustly to the true number 
of states in addition to obtaining better predictive performance on several synthetic and 
realworld datasets. Moreover, we argued that the PG sampler proposed here is a competitive 
alternative to the beam sampler since the time complexity of the particle samplers presented 
is 0{TNK)‘^ versus the 0{TK‘^) of the beam sampler. 

Another advantage of the proposed method is the simplicity of the PG algorithm, which 
doesn’t require truncation or the introduction of auxiliary variables, also making the algo¬ 
rithm easily adaptable to challenging inference tasks. In particular, the PG sampler can 
be directly applied to the sticky HDP-HMM with DP emission model considered in Fox 
et al. (2008) for which no truncation-free sampler exists. We leave this development and 
application as an avenue for future work. 
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Appendix A. Hierarchical Dirichlet Process 

A Dirichlet process (DP), parametrized as DP( 7 ,//), is a stochastic process whose realiza¬ 
tions are countably infinite measures: 


OO 

Gi^f) = ( 6 ) 

k=l 

over some parameter space <h. Here H is the base measure defined on the space while 7 is 
a scalar concentration parameter controlling the variability of the process around H (lower 
7 implies more variability). The weights, 13k of the DP can be sampled via a stick-breaking 
construction (Sethuraman, 1991): 

f3'^ Beta(l, 7 ), Pk = - (3'e) (7) 

referred to as /3 ~ GEM( 7 ). Importantly for our purposes, note the /3fc are defined in a 
purely recursive fashion. 

The Hierarchical Dirichlet Process (HDP) of (Teh et ah, 2006) takes a hierarchical 
Bayesian approach by defining multiple DP’s that share one random measure that is itself 
drawn from a DP. This hierarchical coupling allows one to non-parametrically model individ¬ 
ual subgroups that are generated uniquely but share some overall information. Specifically, 
we have that 


Go ~ DP( 7 , H), Gk ~ DP(a, Go) Vfc ( 8 ) 

Here a controls the variability of each Gk around the shared base measure Go, while H 
is the global base measure over the parameter space. By appealing to the stick-breaking 
construction for the DP we can express the random measures succinctly as: 

OO OO 

P ~ GEM( 7 ), Go ~ E Pk'Sfpyi Gk — '^kk'5<f)y (9) 

k'=l k'=l 

with 

k 

Tv'jk ~ Beta(ao/3fc,ao(l (10) 

i=i 


and 13k defined as before. 

Appendix B. Particle MCMC 

The key idea of the Particle Markov Chain Monte Carlo framework (PMCMC) of Andrieu 
et al. (2010) is that Sequential Monte Carlo (or particle filtering) is used as a complex, 
high-dimensional proposal for Metropolis-Hastings. The Particle Gibbs sampler results 
from using the conditional SMC algorithm, which clamps one particle to an apriori fixed 
trajectory. Crucially, the Particle Gibbs algorithm will leave the target distribution invariant 
(we refer the reader to the original paper for further technical details). 
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First we review the construction of the SMC sampler for finite-state space models. Let 
p{si:T\yi:T) denote the target density of the latent states parametrized by some 0 G 0, with 
prior p{si) over the initial state. Then let {si, wl}^^ be a weighted particle system at time 
t serving as an empirical point-mass approximation to the distribution p(si:t), with the 
variables al denoting the ancestor particles of sj. For the state-space model dynamics, we 
will use 7r(st|si_i) to denote the latent transition density, f{yt\st) the conditional likelihood, 
and p{si-T, yi-.r) the joint likelihood. 

The algorithm is initialized by sampling sj ~ Qi,e{') from a proposal density and initial¬ 
izing the importance weights as w\ = p{si)fg^i{yi\si)/qg^i{si). We can then describe the 
SMC kernel on N particles indexed as i G 1,..., 

Step 1: For iteration t = 1: 

(a) sample ~ gi,e(-) 

(b) initialize weights w\ = pisi)fg^i{yi\si)/qg^i{si) 

Step 2: For iteration t > 1 .■ 

(a) sample the index M.ult{-\W{1^g) of the ancestor of particle i for i G 1, ...,N 

Qi 

(b) sample sj ~ qt,e{-\ 

(c) recompute and normalize weights 


wt,e{s\) = TTe{sl I | s\)/qtfi{s\ \ s 


H-l J 


N 

i=l 


The Particle Gibbs sampler is similar to the SMC sampler, but conditions on the event 
that one particle in the system is constrained to a reference trajectory s\.rp = ...,s^). 

This is accomplished by only resampling for i = 1,..., — 1 in parts b) and c) above. After 

one pass of the conditional SMC algorithm, an entire trajectory is sampled as IP(si. 7 ’ = 
oc where s\.rp is constructed by tracing the ancestors of s\, back through the 
sampled trajectories. 


Appendix C. Sampling Other Variables and Related Work 

The goal of any sampling scheme for the iHMM is to sample the variables si:^, /3, 4>i:K,Oi, 7 , k. 

Building on the direct assignment sampling scheme for the HDP derived in Teh et al. 
(2006), the original Gibbs sampler took the approach of first marginalizing out the infi¬ 
nite, latent variables tt and cf) in ( 6 ). Thus we need only explicitly resample the hidden 
trajectory s, the base DP parameters /?, and hyper parameters a and 7 . Sampling 13, a, 
and 7 follows directly from the theory of the HDP, and the stick-breaking construction. To 
sample st conditional on s-t, (3, y, a, R for t G 1,..., T, we need to compute the conditional 
p{st\s-t,l3, y,a, H) oc p{yt\st, s-t,y-t, F[)p{st\s-t, P, a). The first factor is simply the condi¬ 
tional likelihood: p{yt\st, S-t,y-t, H) = f p{yt\st,4>st)p{4>st\s-t,y-t,H)d4>st, which is easily 
computed when the base distribution H is conjugate to the likelihood /. The second factor 
can be easily computed using the Markov property of the hidden state sequence. Since for 
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each t G 1, ...,T we compute 0{K) probabilities, the Gibbs sampler has 0{TK) complexity. 
The Gibbs sampler’s is straightforwardly implemented but often suffers from slow mixing 
behavior since sequential data tends to be strongly correlated. 

In contrast, the traditional approach for efficient inference of the hidden state trajectory 
in the classical, finite-state space HMM uses the forward-backwards algorithm (i.e. belief 
propagation) to recursively infer the hidden state trajectory in 0{TK^) time where T is 
the length of the HMM and K the size of the latent space. It is tempting to hope a 
similar type of algorithm exists for the iHMM, but it is impossible to directly apply such a 
message-passing approach due to the countably infinite state-space (i.e. K is unbounded). 
However, Van Gael et al. (2008) circumvented this difficulty in the iHMM by introducing 
of a set of auxiliary slice variable ui-t into the model; when conditioned on ui:T the model 
becomes finite. In contrast to the original Gibbs sampling routine, the beam sampler 
iteratively resamples auxiliary slice variables u, the trajectory s, transition matrix vr, shared 
DP measure (3, and hyper-parameters a, 7 conditioned on all other variables. This allowed 
Van Gael et al. (2008) to use dynamic programming to jointly resample the latent states. In 
practice, they found that their sampler mixed much faster than the naive Gibbs sampler and 
had average-case complexity closer to 0{TK) for sparse transition matrices, but worst-case 
complexity 0{TK‘^) (Van Gael et al., 2008). 

Despite the power of the beam-sampling scheme. Fox et al. (2008) found that application 
of the beam sampler to the sticky iHMM, resulted in slow mixing. As noted in Fox et al. 
(2008), for the beam sampler to consider transitions with low prior probability one must 
also have sampled an unlikely corresponding slice variable so as not to have truncated that 
state out of the space. Such a situation becomes increasingly problematic if one must 
make several of these low-probability moves, independently of whether there is strong data- 
dependent likelihood favoring the transition. This problem was avoided in Fox et al. 
(2008) by considering a fixed-order truncation of the HDP-HMM and designing a blocked 
Gibbs sampler to resample the finite set of latent states at the cost of introducing bias into 
the inference. Although the truncation affords the possibility of exploring the full set of 
paths unhindered by the slice variables, one must balance the trade-off between the bias and 
computational cost of the blocked sampler on the truncated model - 0{TK‘^) where K must 
be taken to be large to obtain small bias. This more complex variant of the iHMM coupled 
with a Dirichlet Process (DP) emission distribution achieved state-of-the-art performance 
on a particularly challenging speaking diarization task. Notably, the “stickiness” helped 
eliminate the undesirable fast-transition behaviour characteristic of the HDP-HMM^, and 
the DP emission model helped capture the complex, multimodal nature of the data. Indeed, 
it is worth noting that although the beam sampler can be applied to the sticky iHMM, it 
cannot be applied to the sticky iHMM with a nonparametric DP emission model. In fact, 
no exact sampler has been previously constructed for this model. 

Our resampling scheme for vr, </>, /3, a, 7 , and k will follow straightforwardly from this 
scheme in (Van Gael et al., 2008), (Fox et ah, 2008) and (Teh et ah, 2006). We refer the 
reader to these works for the details on the resampling of a, k and 7 since we use exactly 
the constructions presented there, but present a brief overview of how to sample vr, cj), and 
f3. 

3. This only requires the introduction of a single extra hyper-parameter 
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For simplicity we will review the case of the normal iHMM where n = 0 since the 
introduction of n involves more bookkeeping but does not modify the core scheme. Let 
denote the number of times state i transitions to state j in the trajectory si:T, and K be 
the number of distinct states in si-t- Merging the inhnitely many states not represented 
in s into one state, the conditional distribution of (7r(l|si),7r(iL|si), 7r(sj_,_;^|st)) 

given its Markov blanket s, /3, and a is 

OO 

Vir{nsi^i + al3i,...,ns^^K + a(3K,a ^ (3i) 

i = K+l 

To sample (3 we first introduce a set of auxiliary variables rrijk which can be interpreted 
as the number of times parameter (f)^ has been sampled in tTj (sometimes these parameters 
are referred to as dishes in the Chinese Restaurant Franchise analogy). These parameters 
have conditional distributions equal to pipijk = m|s,/3,a, k) oc S { nij , m ){ al 3 j )^ where 
denote the Stirling numbers of the first kind. As Teh et al. (2006) and Antoniak 
(1974) show this gives the conditional distribution over /3 as T>ir{m.k, where 

m.k = Ylk'=i Conditional on all other variables the are independent of each other 

and can be easily sampled efficiently when the base distribution H is conjugate to the data 
distribution F, though the assumption of conjugacy is not necessary. 
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